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We introduce a simple theoretical approach for an equilibrium study of proteins with known native 
state structures. We test our approach with results on well-studied globular proteins, Chymotrypsin 
Inhibitor (2ci2), Barnase and the alpha spectrin SH3 domain and present evidence for a hierarchical 
onset of order on lowering the temperature with significant organization at the local level even at 
f^s ' high temperatures. A further application to the folding process of HIV-1 protease shows that the 

model can be reliably used to identify key folding sites that are responsible for the development of 
drug resistance . 

o 

Recent experimental and theoretical advances [fij have shown that the topology of the native structure of a protein 
plays an important role in determining many of its attributes. The number of distinct native state conformations of 
proteins is limited 13] - often several distinct sequences fold into the same native state structure. The native state 
structures of proteins contain secondary motifs (helices and sheets) in lower dimensional manifolds which are curled 
into neat patterns (somewhat analogous to the packing of clothes in a suitcase) and play a central role in the folding 
process ||[|. 

The problem of protein folding entails the study of the non-equilibrium dynamics in a rugged free energy landscape 
>-rt ', H • A valuable starting point for attacking such a problem is through a thorough equilibrium analysis of proteins with 
£^ ■ known native state structures. This would be useful for the determination of the folding transition temperature by, for 
example, monitoring the temperature at which thermodynamic quantities such as the specific heat show a peak. Such 
a study would lead to clear indications of the equilibrium conformations of proteins as a function of the temperature 
and provide a detailed picture of the onset of native-state like order on lowering the temperature through the folding 
transition temperature. Ideally, one would like information on the non-native contacts and their role in facilitating 
the onset of native-state ordering. Furthermore, it would be useful to incorporate amino-acid specific interactions 
whenever such information is available. 



> 

V^Q At the present time, there is no simple theoretical framework for accomplishing all these objectives. Go-like models 

t~-- ■ B have proved to be useful for incorporating the role of the topology of the native state structure in the folding 
process. In such models, one ascribes a favorable attractive energy to the native contacts. There have been numerous 
studies of Go-like models, which capture the notion of minimal frustration |6|, but even here, for realistic off-lattice 
calculations, it is hard, if not impossible, to carry out a full exploration of phase space in order to deduce equilibrium 
averages. As mentioned previously, the ruggedness of the free energy landscape carves out only a small part of phase 
space that the system tends to be in leading to non-ergodic behavior even for modest size proteins. Commonly used 
I ■ dynamics such as Monte-Carlo or molecular dynamics tend to result in the system being compartmentalized in phase 

' O ' space because of barriers that are difficult to surmount as the temperature is lowered. 

Recent progress has been made in the development of physically motivated topology-based models |p|,p|-[12[ . The 
models vary greatly in complexity and analytic tractability. The only energy contributions are postulated to arise 
from the establishment of native interactions and therefore it is impossible to recover information on non-native 

. ^ contacts. In addition, a huge reduction in phase space is achieved by introducing suitable constraints on contiguous 

spin variables along the chain. As a consequence, the "true" hiearchical formation of the native state structure may 
lead to incompatibilities with the phase-space constraints. 

In this paper, we present a simple model for calculating the equilibrium properties of heteropolymers or proteins 
with known native states. The model builds on the importance of the native state topology by assigning an attractive 
interaction between nearby amino acids that are known to make native state contacts. It is also possible to incorporate 
amino acid specific interactions into the model. While the model, in its present simple form, does not accurately 
represent the effects of self- avoidance 113] , it nevertheless ensures that the native state is the true ground state and 
satisfies all the steric constraints. The connectedness of the chain as well as its entropy are captured in a simple, but 
non-trivial, manner. The most significant advantage of the model is that it can be used to explore the equilibrium 
thermodynamics without being hampered by inaccurate or sluggish dynamics. A self-consistent approximation is used 



to reduce the model to a Gaussian [O form. The latter lends itself to a straightforward determination of equilibrium 
quantities that identify the key folding sites , such as those targeted by drugs against viral enzymes [|15],[l6| • 

The conformation of a protein is specified by the location of the C a atom of the i-th amino acid in sequence, fi . 
In the native state, fi = ff. A simple effective Hamiltonian (energy function) that captures the essential features of 
proteins is: 

« = \ K E (rv+i - f?,i+i) a + E ^r Xij e [ ~ Xi ' j] ' (1) 

where fij = fi — fj, T is the temperature {Kb = 1), is the step function, 

» M fl ifi>0, ,„,. 

0( x ) = > n 4-u ( 2 ) 

[0 otherwise , 

A is the contact matrix, whose element Ajj is 1 if residues i and j are in contact in the native state (i.e. their C a 
separation is below the cutoff c = 6.5 A) jl7| and otherwise, and 

X l , J = {n J -flJ 2 -R 2 - (3) 

The first term in the Hamiltonian involves harmonic interactions between successive "beads" in the chain |T3] . The 
temperature factor ensures that the free-energy contributions of the peptide chain are constant over the range of 
temperatures relevant to the folding process. The second term in (jlj) provides an attractive interaction when amino 
acids which are in contact in the native state conformation are in the proximity of each other |7|] . Furthermore (|l|) 
guarantees that a specific native target structure is the ground state among all possible three dimensional structures. 
It is straightforward to generalize the Hamiltonian so that all the pairwise interactions do not have the same strength 
but are different and reflect the amino acid-specific interactions. In standard off-lattice approaches, the interaction 
between non-bonded amino acids at a distance d, is taken to be a square well potential, or some type of Lennard- Jones 
interaction. Our choice in Eq. (]l|) is a sort of "harmonic well" which, while being physically sound and viable, is 
suitable for a self-consistent treatment, as explained below. The location of the outer rim of the well is controlled by 
R, which can be set to a few Angstroms (i? = 3 A in the present study) to reflect the fact that, when the separation 
of two residues is appreciably different from the native one, their interaction is negligible. In its present form, the 
model is complex and not amenable to a simple attack. While the first term has a simple quadratic form, the second 
term is difficult to deal with because of the step function. 

The key observation is that a dramatic simplification is accomplished on making a self-consistent Gaussian approx- 
imation within which the partition function is Z — J Hid 3 ri e _/3W , with 

T N ^ A- 

^ = ^E^m + i-^ + i) 2 + E^*^ . ( 4 ) 

»=1 id 

where the Pi.j's are determined self-consistently, in a spirit similar to a local mean field approximation: 

Pi ^{e{R i -{f i , j -^ j f)) H . (5) 

Physically, pij represents the equilibrium probability of the formation of the i-j contact at temperature T. Pi^+\ 
can be conveniently frozen to 1 to reflect the strength of the peptide chain. With the functional form given in (Q), 
the partition function can be readily deduced because all integrals are Gaussian. The free energy, F = —T In Z is: 

OAT O D2 

F = -— ln(27r) - - ln(dct M) - — £ A lm p lm (6) 
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where the inverse matrix M * is defined as 



_ x _ i K{2 - 6 iA - 5 i>N ) + 2J2l &i,lPi,l/T for i = j 



m ^ \ -2 PitJ A id /T + K [-6 i!J+1 - 6ij-i] for i ± j . [l) 

The self-consistency relations for the the probabilities p^ = {9ij)n are satisfied by finding the fixed point of the 
recursion equation: 



p' tJ = (2nG lJ )- 3 / 2 / d 3 r e - 2 / 2G - 9{R 2 - r 2 ). (8) 



where the right-hand side depends on the pij obtained at the previous iteration through the matrix M, which enters 
the definition of Gij — M^i + Mjj — Mjj — Mjj. The solution of the recursion equations for the pt.j, (||), entails 
the evaluation of incomplete T functions and converge to the fixed point very fast and typically an accuracy of 1CP 4 
is reached in a few dozen iterations. Thus the Gaussian nature of the Hamitonian allows a straightforward analytic 
attack on the problem which, when combined with a rapidly convergent iterative procedure on a computer, allows 
one to determine the equilibrium properties of any protein with ease. 

We present here the results for the globular proteins 2ci2, Barnase and the a-spectrin SH3 domain (PDB codes: 2ci2, 
la2p and lshg, respectively) for the simple case of uniform attractive interactions between the amino acids which form 
the native contacts. In all cases, it is straightforward to determine various thermodynamic quantities as a function of 
temperature and identify [ fl9| the folding transition temperature as one at which the specific heat exhibits a maximum 
(see e.g. Figure 0). The width of the specific heat peak at the folding transition in Figure is significantly smeared out 
compared to experiment pCj and theory pit]. The cooperativity of the model can be easily controlled by adjusting 
the value of K, with smaller values leading to sharper "transitions" . In addition, a change of K also impacts on 
the average amount of native structure that is formed at the native state. Because we are particularly interested 
in characterizing the progressive formation of native interactions, we chose the strength of the peptide chain, K, by 
inspecting the behaviour of the fraction of native contacts, Q, as a function of temperature. Q, which is often termed 
"native-state overlap" , is defined as 






(9) 



where the prime denotes that the sum is not carried out over consecutive pairs, in order to exclude the effects of 
the peptide bond. We find that, almost irrespective of the length of the target proteins, a value of K = 1/15 yields 
Q w 0.5 at the folding transition, consistent with experimental findings |2^] and previous observations [g 11 1G|,|2T 



While Q is a good global parameter to characterize the progress towards the native state in a folding process, it is 
useful to monitor the onset of native ordering at the level of individual residues. The quantity, Pi, 

provides an intuitive measure P,p5| of the degree to which amino acid i is in its native-like conformation. Figure @ 
shows the profiles of such environments for each amino acid in proteins CI2, barnase and the a-spectrin domain of 
SH3 as a function of the temperature. 

In agreement with the experimental findings on these heavily investigated proteins p^-E8[ , and also with theoretical 
predictions p9|-pl|, we observe that the secondary structures form at relatively high temperatures and condition 
subsequent folding events. Note the significant lack of ordering of the loop regions even at the folding transition 
temperature. /3-sheets are seen to form along one preferred direction, while the formation of a-helices occurs from 
the ends. In general, the tendency of a site to reach its native environment increases with both its degree of burial 
and the locality of the contacts it forms. The intricate details of the figure reflect an incremental assembly process 
and also the complex interplay between the two effects mentioned above. 

To further corroborate the validity of the proposed model in capturing the important folding steps we consider 
an application to an important enzyme, the protease of the HIV-1 virus (pdb code laid), which plays a vital role 
in the spreading of the viral infection. Through extensive clinical trials jig] , it has been established that there is a 
well-defined set of sites in the enzyme that are crucial for developing, through suitable mutations, resistance against 
drugs and which play a crucial role in the folding process p6| . 

To identify the key folding sites we looked for contacts that are significantly formed above the folding transition 
temperature. Quantitatively we define the formation temperature of a contact as one at which a given p^j takes 
on the value 0.5. Our results are summarized in Fig. (upper-left triangular region). As a comparison, in the 
lower-right triangular region of the same figure, we have highlighted the contacts involving the known mutating sites. 
Remarkably, among the top 20 contacts with the largest formation temperature there were 10 including one (or more) 
known mutating site. A straightforward calculation shows that the probability of observing this many successful 
matches by picking contacts at random is less than 2 %, confirming that our model captures important aspects of the 
folding process with remarkable precision and reliability. 



In summary, the self-consistent Gaussian approach provides a simple way of probing the equilibrium properties 
of proteins with the incorporation of amino acid specific interactions (when known) shorn away from the usual 
complications associated with imperfect or inadequately studied dynamics. A key advantage is that our approach is 
essentially analytic and the quantities of interest may be determined with arbitrary accuracy quite easily. 

We are indebted to Alessandro Flammini for a careful reading of the manuscript This work was supported by INFM, 
Murst Cofin99, and NASA. 
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FIG. 1. Plot of the specific heat (in arbitrary units) and the native-state overlap as a function of temperature for the protein, 
Chymotrypsin Inhibitor. The temperature is measured in units of the folding transition temperature (identified through the 
maximum of the specific heat). 




20 40 



site 
a- spectrin SH3 




FIG. 2. Plot of Pi, the degree to which amino acid i is in a native-like conformation, versus i for (a) 2ci2, (b) Barnase and 
(c) Q-spectrin SH3. In ascending order the curves are calculated at T— 2.0, 1.5, 1.0, 0.5 and 0.35 (measured in units of the 
folding transition temperature). The bar at the bottom shows the secondary structure associated with amino acid i. 
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FIG. 3. Contact map of HIV-1 PR monomer. Upper-left triangular region: contacts with a large [small] formation tem- 
perature are shown in dark [light] gray. Lower-right triangular region: contacts [not] involving one or more of the known key 
mutating sites are shown in dark [light] gray. 



